Medical image segmentation model based on triple gate MultiLayer perceptron

To alleviate the social contradiction between limited medical resources and increasing medical needs, the medical image-assisted diagnosis based on deep learning has become the research focus in Wise Information Technology of med. Most of the existing medical segmentation models based on Convolution or Transformer have achieved relatively sound effects. However, the Convolution-based model with a limited receptive field cannot establish long-distance dependencies between features as the Network deepens. The Transformer-based model produces large computation overhead and cannot generalize the bias of local features and perceive the position feature of medical images, which are essential in medical image segmentation. To address those issues, we present Triple Gate MultiLayer Perceptron U-Net (TGMLP U-Net), a medical image segmentation model based on MLP, in which we design the Triple Gate MultiLayer Perceptron (TGMLP), composed of three parts. Firstly, considering encoding the position information of features, we propose the Triple MLP module based on MultiLayer Perceptron in this model. It uses linear projection to encode features from the high, wide, and channel dimensions, enabling the model to capture the long-distance dependence of features along the spatial dimension and the precise position information of features in three dimensions with less computational overhead. Then, we design the Local Priors and Global Perceptron module. The Global Perceptron divides the feature map into different partitions and conducts correlation modelling for each partition to establish the global dependency between partitions. The Local Priors uses multi-scale Convolution with high local feature extraction ability to explore further the relationship of context feature information within the structure. At last, we suggest a Gate-controlled Mechanism to effectively solves the problem that the dependence of position embeddings between Patches and within Patches in medical images cannot be well learned due to the relatively small number of samples in medical images segmentation data. Experimental results indicate that the proposed model outperforms other state-of-the-art models in most evaluation indicators, demonstrating its excellent performance in segmenting medical images.

sub-partitions of the image, it only focuses on local features and loses the global context features, and cannot establish long-term dependence between features because of its inherent bias when extracting features. Convolution's continuous stacking and down-sampling operation can increase the model's receptive field and enable Convolution to extract the interactive information between local features, but this method will make the model more complex and prone to overfit.
At present, some studies have modelled the long-distance dependence between features, such as the Attentional mechanism 6 and Transformer 7 . However, the Attention mechanism is not available in medical image segmentation because it is highly dependent on external information and cannot capture the internal correlation of data or features. Therefore, the Attention mechanism still needs to be further improved. Recently, the Transformer architecture has been widely discussed in medical segmentation tasks. TransUNet 8 proposed by Chen et al., uses the Transformer to encode the feature map extracted by CNN and uses the extracted global context information to model remote dependency. ViT 9 -based TransFuse 10 proposed by Zhang et al., together with Transformer and CNN, can improve the efficiency of global context modelling without losing the ability to locate low-level details. Although the above models based on Transformer show great potential in medical image segmentation, they have the following shortcomings: • Transformer enhances the model's ability to extract global features through multi-head Attention but does not increase the Local Priors. The lack of Local Priors to summarize feature biases between data requires many training data to make the model converge. • Part of medical images have fixed position priors, and the multi-head Attention in Transformer does not share parameters of all positions, limiting the use of position information. For example, the human liver, which is in the lower part of the heart and the upper right part of the stomach, needs to be segmented for detection. Convolution in Transformer cannot fully use the parameter to share the heart, stomach, and liver information for position perception.
Recently, Tolstikhin et al. proposed the MLP-Mixer model 11 based on MLP, which uses full connections to flatten feature maps along the channel axis and spatial axis and encodes them so that the MLP-Mixer can model global context information relationships, as shown in Fig. 1a. Although the MLP-Mixer is more efficient in modelling global context information than Convolutional Neural Networks and Transformer, MLP-Mixer has the following weaknesses: • The MLP-Mixer linearly projects and encodes spatial information along the spatial dimension, as shown in Fig. 1c, which destroys the feature structure in the original space dimension, resulting in the loss of position information carried by three-dimensional features and the computation overhead increasing quadratically. • The MLP-mixer uses a full connection to replace Convolution, leading to the loss of spatial information of small-scale objects in feature maps and the lack of local prior features.  To address the above problems, we propose a new model in medical image segmentation in this paper: TGMLP U-Net. In this model, we design the TGMLP, as shown in Fig. 1b, composed of three parts. Firstly, a TM module is advanced, consisting of three independent branches, and each branch encodes along a specific dimension (high, wide, or channel dimension). It can preserve the feature structure in the original space dimension of the input feature map, enable the model to generate long-distance dependencies, generate specific position information in three dimensions, and reduce the computation overhead from quadratically increasing caused by encoding along spatial dimension to linearly increasing. Besides, LGP is proposed. The Global Perceptron divides the feature map into different partitions. It transmits them into multiple Full-Connection Layers(FC) to share the parameters among different partitions, reducing the loss of small-scale feature information in segmenting medical images and modelling global context more effectively. Meanwhile, the Local Priors constructs the CNN and Batch Normalization (BN) parallel to FC. CNN and BN are used to extract local features, avoiding the loss of local feature correlation caused by feature splitting. Finally, as the trained images with corresponding labels in medical DataSets is relatively rare and the cost of making labels is also high, we design a Gate-controlled Mechanism in this paper. The Gate is a learnable parameter, making the proposed mechanism applicable to DataSet of any size. According to the size of the DataSet, the Gate-controlled Mechanism will know whether the number of images is sufficient to learn the correct position for embedding. Depending on whether the information learned by position embedding is useful, the Gate parameters will either converge to 0 or a higher value. To sum up, the contributions of this paper are as follows: • We design a TM module, a new structure to encode spatial feature information. It can encode spatial feature information along the high, wide and channel dimensions, generate long-distance dependence, and perceive position sensitively while preserving the feature structure in the original space dimension of the input feature map with little computational overhead. • We introduce the LGP module to complementarily extract local and global features, making the model perceive features in small-scaled and complex objects better and extract features better. • We present the Gate-controlled Mechanism applicable to DataSets of different sizes, making it easier to learn the position bias of feature maps in medical image DataSets of different sizes. • Experimental results illustrate that the proposed model excels other state-of-the-art models in most evaluation indicators, demonstrating its outstanding performance in segmenting medical images.

Related work
In this section, first of all, we summarize the typical Convolutional Neural Network-based methods in medical image segmentation. Then, we review the application of Transformer, especially in medical segmentation. Finally, we list the existing MultiLayer Perceptron (MLP) methods and compare them with the proposed model. www.nature.com/scientificreports/ global and local features by using Convolution and Transformer to achieve a good segmentation effect. Different from the above models, the TGMLP structure proposed by us replaces the Transformer with a MLP structure, which reduces the computational complexity of the Transformer from the square calculation to the linear calculation of the input image size and eliminates the influence of position-perception limitations of multi-head Attention in Transformer. The TGMLP structure can also alternately implement information exchange between Patches and within Patches in the feature map to improve segmentation efficiency.
MultiLayer perceptron-based model. Compared with single-layer perceptron, multiLayer perceptron (MLP) adds a hidden layer to solve the problem of nonlinear separability. Its neurons in each layer are fully connected, endowing MLP with good parallel processing ability, fault tolerance ability, adaptive learning ability and memory ability. In addition, compared with Convolution, MLP is mainly characterized by full connectivity, which is different from the local connectivity of the Convolution layer. With those advantages, MLP has gradually become the research focus. For example, MLP-Mixer can encode spatial information and performs better, but it requires a large-scale DataSet for training to achieve good results. The ResMLP 20 proposed by Touvron et al. uses flattened image blocks as input, adopts the linear layer to project the input, and then uses the residual operations to update the projected features. Finally, the obtained feature blocks are classified after average-pooling. The training of ResMLP is more stable than that of Transformer and CNN. Liu et al. 's newly proposed Attention-Free Network gMLP 21 controls the amount of information in feature maps only with Gating MLP, making the performance of MLP comparable to Transformer in language and visual tasks. However, these models input three-dimensional spatial features of images two-dimensionally for linear projection, which cause the loss of some position information and feature information of small-scale objects and the lack of local prior features. Different from the above research, the TGMLP presented by us can encode three-dimensional spatial features based on MLP, together with a LGP module and a Gate-controlled Mechanism, making the model more sensitive to the position information in the feature map and capable of modelling long-term dependencies and encoding local prior features more accurately. The model we explored is suitable for medical image segmentation, as the tissue structures of the body in medical images are often very different and complex, and the DataSet for segmentation is relatively small.

Methods
This section introduces the proposed Triple Gate MultiLayer Perceptron U-Net (TGMLP U-Net) model for medical image segmentation exhaustively. Specifically, we briefly introduce the basic architecture of the model; then, we describe its main components in detail: Triple MLP (TM) structure, Local and Global Perceptron, and Gate-controlled Mechanism.
Triple gate multilayer perceptron U-net. TGMLP U-Net uses TGMLP as the basis and a Local-global training strategy. In the medical image segmentation task, the segmentation mask is larger than the Patch size, limiting the model to learn the feature information and dependence of pixels between Patches. Our model adopts a two-branch structure based on the Medical Transformer 17 suggested by Valanarasu et al. to make the model better understand the medical image. The two-branch structure includes local branch structure and global branch structure. The global branch is mainly used to learn the long-distance feature relationship, and the local branch can make up for the lost local detail features between Patch pixels. When TGMLP U-Net predicts, it firstly stacks the features of all Patch blocks output by local branches, then uses the add function to add the feature maps extracted from global branches and local branches, and finally uses 1 × 1 Convolution layer to classify feature maps at the pixel level. Before entering the two-branch structure, the medical image will go through three Convolutional Layers for initial feature extraction, and each Convolutional layer has a normalization and Relu activation function. The Encoders of the two branches are composed of a Convolution layer and TGMLP structure. The Encoder makes the image features gradually reduced and abstracted while the Decoder gradually restores the output of the Encoder to the original size, classifies pixel by pixel, and obtains the segmented image.
In the Encoder of TGMLP U-Net, TGMLP encodes feature maps along the high axis, wide axis and channel axis, respectively. Then the Global Perceptron and the Local Priors are incorporated into the TGMLP structure, which not only enables the model to conduct modelling for the global context information of the feature map and establish the external dependencies between the global features but also enables the model better to extract the local information of the feature map. Finally, TGMLP adds a Gate-controlled Mechanism to control the output information and retain the feature information in the feature map to the maximum extent. The encoded features output by TGMLP will be connected with a 1 × 1 Convolution layer, and the features after Convolution will be connected with residual mapping. Then add function is used to add the features after Convolution and the features input into TGMLP to obtain the final encoded feature map. Figure 2b indicates the architecture of TGMLP and Convolution in the Encoder. The Decoder comprises a 3 × 3 Convolution layer, Deconvolution and jump connection. The Convolution in the Decoder is to reduce the number of channels in the feature map, while the Deconvolution is to increase the size of the feature map in turn. The Deconvolution results in the decoding part and the output of the encoding part are correspondingly connected and combined by jump connection, recovering feature information gradually. Figure 2c shows an overview of the Decoder. There are two Encoders and two Decoders in the global branch of the TGMLP U-Net. There are five Encoders and five Decoders in the local branch. The overall architecture of GMLP U-Net is shown in Fig. 2a.
Triple MultiLayer perceptron. As Fig. 1a shows, the recent MLP-Mixer is composed of two different MLP blocks: Channel MLP and Spatial MLP, responsible for encoding channel information and spatial information,  (1) and (2), MLP-Mixer calculates the global affinity and models the global context information in a long-distance dependency way. Different from Convolution, MLP-Mixer can capture non-local information from the entire feature mapping. The spatial MLP in MLP-Mixer needs to calculate the spatial relationship between Patches, whose computational complexity is O(H 2 W 2 ) ,making MLP-Mixer unapplicable to segment medical images with high density and high resolution. In addition, different from Convolution, when calculating local feature information, MLP-Mixer uses spatial MLP to encode two-dimensional spatial feature information, which is not conducive to obtaining the position information. However, position information is crucial in medical image segmentation and is often used to help extract the structure of the segmented object. For efficient and accurate modelling, the TGMLP proposed by us is projected along the feature map's high axis, wide axis and channel axis, as shown in Fig. 1b. With better computation efficiency, this method reduces the computation complexity of TGMLP to O(HWC) . It also makes the model more sensitive to the position information, which can capture the remote interaction with accurate position information, and can encode the spatial structure of medical images. Therefore, for a given input feature mapping x ∈ R C in ×H×W with height H, width W, and channel C in , the TGMLP output S i of the i th layer with high axis, wide axis and channel axis is expressed as: (1) The TMLP structure is shown in Fig. 3. The Global Perceptron divides the feature map into different partitions and makes them share parameters. The way we use to divide partitions is as follows: First, the feature map with the input size of x ∈ R N×C×H×W is divided into h partitions; the size of the feature map is reset to (h 2 N, C, W h , H h ) ; the axes are sorted again, and the size of the feature map becomes (N, W h , H h , C, h, h) , as shown in Formula (6).
Where RS stands for a function that changes the shape specification of the tensor without changing the order of data in memory; Permute means to reorder the axes of the feature map. Then the global average pooling operation is used to get a matrix of size (N, W h , H h , C, 1, 1) and input the matrix into BN and a two-layer MLP to get a weight matrix of size(N, W h × H h × C) , as shown in Formula (7).
Where GAP denotes global average pooling, W represents Convolution kernel, and MLP represents RS x out , (N, To achieve correlation between different partitions of the same channel, we first reset the weight matrix to (N, W h , H h , C, 1, 1) ; then use the Expend function in Pytorch to change the weight of the matrix to (N, W h , H h , C, h, h) ; finally we use the Add function to add the weighted matrix to each partition to get the feature map M out with the size of (N, W h , H h , C, h, h) , as shown in Formula (8). The Global Perceptron realizes the correlation between each pixel and different partitions, offsetting the loss of small-scale objects in feature extraction. www.nature.com/scientificreports/ The Local Priors module first changes the tensor's shape specification output by Global Perceptron into (N, H, W, C) and constructs four parallel Convolutional Layers; a BN layer follows each Convolutional layer. Then, the tensor with changed shape specification is input into four parallel Convolutional Layers. The four parallel Convolutional Layers solve the problem of local structural feature loss in feature extraction. The Convolution kernels' sizes of the four Convolutional Layers are 1,3,5,7, respectively. The padding of the Convolution is used to ensure resolution ( p = 0, 1, 2, 3 ). Finally, outputs of all Convolution branches are added with those of TMLP to obtain the final output. The calculation method of Local Priors is shown in Formulas (9) and (10).
Where F represents the Convolution kernel of the four Convolution Layers, they are 1,3,5,7 respectively; P represents the number of pixels that the Convolution layer uses to fill, they are 0,1,2,3 respectively. S i is the value of formula (5).
Gate-controlled mechanism. We have discussed the benefits of applying TMLP, LGP in medical image segmentation. They enable TGMLP to calculate global context feature information with good computation efficiency and encode remote interactions within the input feature mapping. However, TGMLP is more likely to learn position bias when evaluated on an extensive medical DataSet. Position bias is challenging to learn in experiments conducted on small-scale medical image DataSets, so encoding remote interactive position information is inaccurate. When the learnt position bias is not accurate enough, TGMLP with TMLP cannot give full play to the performance of TMLP. Therefore, we explore a TMLP with the Gate-controlled Mechanism to control the influence of position bias on local position perception. After modifying TMLP, we apply the Gate-controlled Mechanism the high axis of the TMLP, as expressed by Formula (11). The Gate-controlled Mechanism applied to the wide axis and channel axis of TMLP is the same as formula (11).
Among them, the Gate-controlled Mechanism is added to the formulas (3), (4), and (5) of TMLP, they are G C , G H , G W ∈ R , respectively. They are learnable parameters and jointly construct the Gate-controlled Mechanism. Usually, if the model accurately learns the position-coded information, the Gate-controlled Mechanism will assign high position weights to the axis of the TMLP.

Experiments and analysis
In  Table 5. After calculating the Loss Function, we notice that the Cross-Entropy Loss Function can achieve the optimal results. Therefore, it is adopted in this paper, as shown in Formula (12).
(9) U out =RS M out , (N, W, H, C) , Where the larger the value of α is, the larger the Loss Value contributed by the object to be segmented will be. γ represents the regulating factor to reduce the weight of the background, making the model focus on segmenting the objects in medical images. In our experiments, we set α as 0.25 and γ as 2, respectively.
Experimental parameters. The experimental configuration is listed as follows: video card: NVIDIA GeForce RTX 3070Ti, the processor: Intel i7-8700 CPU, the memory: 16G, the software platform: Windows 10 Pytorch 1.8 and Python 3.6. We finally select a group of experimental parameters with the best experimental results by constantly adjusting the experimental parameters. The input image size is fixed at 128 × 128 , and the iteration times of training are 400 generations. The optimizer is set to Adam with a batch size of 10. The initial learning rate of the Adam algorithm is set to 0.001, and the minimum learning rate is set to 0.00001. The attenuation strategy of the learning rate is cosine annealing, and warmup epochs is set to 100. To speed up the model's convergence, we carry out batch normalization for each Convolution layer with epsilon as 0.00000 1 and momentum as 0.1. To ensure the fairness of the comparison, we use the same training settings for all the models involved in this experiment.
Experimental results. COVID-19 DataSet. As shown in Table 1 Table 3, demonstrating that TGMLP U-Net outperforms the advanced DS-TransUNet-L 29 and the latest FANet 30 . The values of F1, mIoU, Recall and Precision of TGMLP U-Net increase to 98.21%, 96.57%,97.25% and 95.28%, respectively, which are 3.99%, 7.18%, 2.25%, and 1.59% higher than those of DS-TransUNet-L, respectively. Although DoubleU-Net performs well in Precision, our model produces the best overall performance. As shown in Fig. 5, the visualized results show that our model can accurately predict the position and boundary of colonic polyps and better segment them from normal skin.
MoNuSeg DataSet. We compare our model with other general segmentation models on a small-scale medical DataSet to increase our model's creditability. MoNuseg is a DataSet with only 30 training samples. The experi-     Fig. 7, the integration of TM enables the model to segment medical images in COVID-19 with the precise position. The main factor of the above performance improvement lies in that TM encodes along the length, wide and high dimensions, making the model generate long-distance dependence between features and  The effectiveness of LGP. We verify the importance of LGP. Compared with line 2, it can be seen from line 3 of Table 6 that the values of mDice, SEN, SPC and DSC increase by 2.2%, 2% and 4.3%, while HD decreases by 8.4%, respectively. Case 2 in Fig. 7 shows that LGP can complementarily extract local and global features. The segmented image contains relatively complete salient objects and preserves the details of the image.
The effectiveness of the Gate-controlled Mechanism. We also validate the importance of the Gate-controlled Mechanism. Line 4 of Table 6 shows that the Gate-controlled Mechanism can improve segmentation efficiency. As shown in case 3 of Fig. 7, although LGP and TM can make pixels very close to the segmentation mask, our model learns the dependence between pixel position features better and make LGP and TM find more qualitative features because our method considers the dependence between pixel positions encoded by the Gate-controlled Mechanism.

Conclusion
Triple Gate MultiLayer Perceptron U-Net (TGMLP U-Net), a medical image segmentation model, is proposed in this work, which can segment medical images precisely with less computation overhead. Its performance attributes to three modules: TM, LGP, and a Gate-controlled Mechanism. The TM module encodes coarse-grained   Table 6. Ablation experiment of TGMLP U-Net model. The best results are shown in bold. www.nature.com/scientificreports/ and fine-grained feature information from the high, wide, and channel axes, establishing the long-distance dependence between features and outputs sensitive position feature information, significantly improving the model's performance. In the LGP module, the Global perceptron can be regarded as a sparse full connection with shared parameters, and Local Priors can be regarded as multi-scale Convolution. They can adapt to dynamic features' changes and complementarily extract local and global features, enabling the model to segment complex medical images. The Gate-controlled Mechanism enables the model to dynamically adapt to DataSets of different sizes, learn the position embedding dependencies between and within Patches, and make the model more sensitive to position information. We use three classical medical image segmentation DataSets to verify the performance of TGMLP U-Net in the experiment, whose results demonstrate its excellent performance in segmenting medical images. However, our work still has much room for further improvement. For example, whether TGMLP U-Net is suitable for other segmentation tasks or whether it can be reparametrized to accelerate the segmentation speed has not been discussed in detail. In the future, we will continue to study the application of TGMLP U-Net in other image processing tasks, such as denoising, object detection and image super-resolution, to make it applicable to a broader range of computer vision tasks.